1. Executive summary

The aim of this report is to verify factors influencing final outcome of COVID19 disease. Correlations counted in Chapter 5 verify the theory about influence of 3 blood parameters mentioned in the article. Chapter 6 describes a trial of creating a prediction model basing on mentioned parameters.

2. R libraries used in this report

In this report following R libraries have been used:

library(openxlsx)
library(ggplot2)
library(dplyr)
library(plotly)
library(corrplot)
library(caret)
library(summarytools)
library(backports)
library(skimr)

3. Extracting the data

3.1. Loading data into RawDataDF dataframe

FileName <- "wuhan_blood_sample_data_Jan_Feb_2020.xlsx"
RawDataDF <- data.frame(openxlsx::read.xlsx(FileName, sheet = "Sheet1", colNames = TRUE, fillMergedCells = TRUE) )

3.2. The source file has following attributes:

knitr::kable(file.info(FileName))
size isdir mode mtime ctime atime exe
wuhan_blood_sample_data_Jan_Feb_2020.xlsx 634526 FALSE 666 2021-05-10 15:31:09 2021-05-13 11:07:23 2021-06-05 22:58:14 no

4. Cleaning the data

4.1. Adjusting date/time columns

ClearedDataDF <- mutate(RawDataDF, RE_DATE = openxlsx::convertToDateTime(RE_DATE), Admission.time=openxlsx::convertToDateTime(Admission.time),Discharge.time = openxlsx::convertToDateTime(Discharge.time))

4.2. Grouping the data by patient

GroupedByPatientDF <- aggregate(ClearedDataDF[, -c(1, 2, 5, 6)], by = list(ClearedDataDF$PATIENT_ID), FUN = mean, na.rm = TRUE) %>% rename(PATIENT_ID = Group.1) %>% arrange(PATIENT_ID)
GroupedByPatientDF["outcome"][GroupedByPatientDF["outcome"] == 1] <- "died"
GroupedByPatientDF["outcome"][GroupedByPatientDF["outcome"] == 0] <- "survived"

4.3. Characteristics of input dataset

skim(GroupedByPatientDF)
Data summary
Name GroupedByPatientDF
Number of rows 375
Number of columns 78
_______________________
Column type frequency:
character 1
numeric 77
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
outcome 0 1 4 8 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
PATIENT_ID 0 1.00 188.00 108.40 1.00 94.50 188.00 281.50 375.00 ▇▇▇▇▇
age 0 1.00 58.83 16.46 18.00 46.00 62.00 70.00 95.00 ▂▃▇▇▂
gender 0 1.00 1.40 0.49 1.00 1.00 1.00 2.00 2.00 ▇▁▁▁▆
Hypersensitive.cardiac.troponinI 88 0.77 765.96 4273.16 1.90 2.56 13.00 75.97 50000.00 ▇▁▁▁▁
hemoglobin 19 0.95 125.22 18.79 61.80 113.88 125.76 138.00 178.00 ▁▂▇▅▁
Serum.chloride 21 0.94 102.41 6.58 74.60 99.00 101.80 104.58 140.20 ▁▆▇▁▁
Prothrombin.time 23 0.94 15.61 5.08 11.50 13.50 14.31 16.02 83.35 ▇▁▁▁▁
procalcitonin 62 0.83 0.88 3.21 0.02 0.04 0.10 0.44 29.24 ▇▁▁▁▁
eosinophils… 19 0.95 0.68 0.94 0.00 0.00 0.25 0.98 5.55 ▇▂▁▁▁
Interleukin.2.receptor 159 0.58 934.60 807.08 65.50 465.25 693.50 1172.50 7500.00 ▇▁▁▁▁
Alkaline.phosphatase 19 0.95 82.27 48.51 17.00 54.62 69.29 92.40 620.00 ▇▁▁▁▁
albumin 19 0.95 33.12 5.45 18.55 29.10 33.15 37.11 46.30 ▂▆▇▇▂
basophil… 19 0.95 0.22 0.19 0.00 0.10 0.20 0.30 1.70 ▇▂▁▁▁
Interleukin.10 160 0.57 13.92 51.46 5.00 5.00 6.20 12.25 750.00 ▇▁▁▁▁
Total.bilirubin 19 0.95 15.33 27.45 2.75 7.40 10.59 14.80 390.85 ▇▁▁▁▁
Platelet.count 19 0.95 195.17 93.61 -1.00 131.01 197.00 249.84 554.00 ▃▇▅▁▁
monocytes… 19 0.95 6.62 3.93 0.50 3.59 6.48 9.00 35.20 ▇▅▁▁▁
antithrombin 173 0.54 87.82 16.14 42.00 77.00 88.00 98.00 130.00 ▁▅▇▅▂
Interleukin.8 159 0.58 67.05 312.97 5.00 9.20 16.60 35.40 3409.00 ▇▁▁▁▁
indirect.bilirubin 20 0.95 6.73 6.82 1.10 4.03 5.50 7.60 102.40 ▇▁▁▁▁
Red.blood.cell.distribution.width 25 0.93 13.00 1.64 10.70 12.00 12.54 13.60 25.60 ▇▂▁▁▁
neutrophils… 19 0.95 75.75 16.00 1.80 63.29 76.40 90.70 98.80 ▁▁▃▆▇
total.protein 19 0.95 66.33 5.99 47.20 62.54 66.72 70.87 81.50 ▁▂▇▇▁
Quantification.of.Treponema.pallidum.antibodies 100 0.73 0.13 0.78 0.02 0.04 0.05 0.07 11.95 ▇▁▁▁▁
Prothrombin.activity 23 0.94 82.57 19.75 25.00 69.00 86.00 96.00 142.00 ▁▃▇▅▁
HBsAg 100 0.73 8.43 43.23 0.00 0.00 0.01 0.01 250.00 ▇▁▁▁▁
mean.corpuscular.volume 19 0.95 89.90 6.10 61.95 86.70 89.94 93.11 116.15 ▁▁▇▂▁
hematocrit 19 0.95 36.86 5.06 17.56 33.88 36.77 39.88 52.30 ▁▂▇▅▁
White.blood.cell.count 18 0.95 13.14 26.35 0.72 5.36 7.78 13.15 370.93 ▇▁▁▁▁
Tumor.necrosis.factor.U.03B1. 159 0.58 11.62 10.99 4.00 6.90 8.43 11.60 103.55 ▇▁▁▁▁
mean.corpuscular.hemoglobin.concentration 19 0.95 343.66 16.13 299.00 335.25 343.33 351.00 488.00 ▃▇▁▁▁
fibrinogen 77 0.79 4.44 1.55 0.55 3.40 4.36 5.41 8.95 ▂▆▇▅▁
Interleukin.1ß 159 0.58 6.57 7.06 5.00 5.00 5.00 5.00 88.50 ▇▁▁▁▁
Urea 19 0.95 8.62 8.11 1.70 3.85 5.43 9.95 58.10 ▇▁▁▁▁
lymphocyte.count 19 0.95 1.11 2.29 0.03 0.55 0.91 1.37 43.06 ▇▁▁▁▁
PH.value 143 0.62 6.45 0.63 5.00 6.00 6.50 6.96 7.57 ▃▇▇▇▅
Red.blood.cell.count 18 0.95 7.91 16.97 1.85 3.87 4.44 5.38 252.25 ▇▁▁▁▁
Eosinophil.count 19 0.95 0.04 0.05 0.00 0.00 0.02 0.06 0.38 ▇▂▁▁▁
Corrected.calcium 22 0.94 2.34 0.12 2.07 2.26 2.36 2.43 2.79 ▃▆▇▁▁
Serum.potassium 21 0.94 4.41 0.63 3.13 4.02 4.34 4.67 6.86 ▂▇▃▁▁
glucose 24 0.94 8.43 4.18 1.00 5.63 6.99 9.53 28.98 ▇▇▂▁▁
neutrophils.count 19 0.95 7.38 5.38 0.32 3.30 5.47 10.70 26.80 ▇▃▃▁▁
Direct.bilirubin 19 0.95 8.63 21.39 1.60 3.20 4.62 7.18 288.45 ▇▁▁▁▁
Mean.platelet.volume 29 0.92 10.89 0.99 8.50 10.15 10.80 11.45 14.20 ▂▇▇▂▁
ferritin 162 0.57 1488.11 3860.07 17.80 409.80 753.00 1418.30 50000.00 ▇▁▁▁▁
RBC.distribution.width.SD 25 0.93 41.96 6.02 31.30 38.42 40.62 43.80 100.10 ▇▂▁▁▁
Thrombin.time 77 0.79 17.63 5.31 13.60 15.80 16.64 18.00 92.75 ▇▁▁▁▁
X…lymphocyte 19 0.95 16.69 12.26 0.15 5.11 14.70 26.33 52.35 ▇▅▅▂▁
HCV.antibody.quantification 100 0.73 0.11 0.22 0.02 0.04 0.06 0.09 2.09 ▇▁▁▁▁
D.D.dimer 33 0.91 6.20 8.06 0.21 0.51 1.35 12.03 40.50 ▇▁▂▁▁
Total.cholesterol 19 0.95 3.67 0.87 1.00 2.98 3.64 4.23 6.16 ▁▅▇▅▁
aspartate.aminotransferase 19 0.95 47.99 93.92 7.67 20.50 29.00 43.00 959.50 ▇▁▁▁▁
Uric.acid 19 0.95 281.11 135.76 84.20 200.38 245.73 328.33 993.00 ▇▅▁▁▁
HCO3. 19 0.95 22.86 3.67 10.00 21.25 23.52 25.38 30.07 ▁▂▃▇▃
calcium 21 0.94 2.10 0.13 1.78 2.01 2.11 2.20 2.58 ▂▇▇▂▁
Amino.terminal.brain.natriuretic.peptide.precursor.NT.proBNP. 108 0.71 2688.23 7776.71 5.00 66.25 318.00 1548.75 70000.00 ▇▁▁▁▁
Lactate.dehydrogenase 19 0.95 462.10 343.36 116.00 224.00 306.43 610.25 1867.00 ▇▂▁▁▁
platelet.large.cell.ratio 29 0.92 31.62 7.88 11.20 25.77 30.91 36.38 58.60 ▂▇▇▂▁
Interleukin.6 157 0.58 107.40 438.82 1.50 5.97 21.09 61.94 5000.00 ▇▁▁▁▁
Fibrin.degradation.products 173 0.54 46.83 58.14 4.00 4.00 5.70 94.19 190.80 ▇▁▁▂▁
monocytes.count 19 0.95 0.56 1.95 0.03 0.32 0.42 0.54 36.94 ▇▁▁▁▁
PLT.distribution.width 29 0.92 13.00 2.58 8.20 11.15 12.60 14.16 24.20 ▅▇▂▁▁
globulin 19 0.95 33.20 4.88 18.50 30.29 32.98 36.25 48.20 ▁▃▇▃▁
X.U.03B3..glutamyl.transpeptidase 19 0.95 50.78 63.92 7.00 21.31 33.20 52.62 732.00 ▇▁▁▁▁
International.standard.ratio 23 0.94 1.23 0.46 0.84 1.03 1.10 1.28 5.45 ▇▁▁▁▁
basophil.count… 19 0.95 0.02 0.01 0.00 0.01 0.01 0.02 0.10 ▇▂▁▁▁
X2019.nCoV.nucleic.acid.detection 157 0.58 -1.00 0.00 -1.00 -1.00 -1.00 -1.00 -1.00 ▁▁▇▁▁
mean.corpuscular.hemoglobin 19 0.95 30.91 2.78 20.80 29.75 30.87 32.10 50.80 ▁▇▂▁▁
Activation.of.partial.thromboplastin.time 77 0.79 40.91 7.47 21.80 36.41 39.50 43.77 102.75 ▆▇▁▁▁
High.sensitivity.C.reactive.protein 22 0.94 70.41 71.10 0.10 10.20 49.00 118.10 320.00 ▇▃▂▁▁
HIV.antibody.quantification 101 0.73 0.10 0.04 0.05 0.07 0.09 0.11 0.27 ▇▃▂▁▁
serum.sodium 21 0.94 140.74 6.11 119.10 138.01 139.94 142.55 179.60 ▁▇▂▁▁
thrombocytocrit 29 0.92 0.21 0.08 0.01 0.16 0.22 0.26 0.51 ▂▇▇▂▁
ESR 87 0.77 33.87 24.12 1.00 15.00 30.00 47.00 106.00 ▇▆▃▂▁
glutamic.pyruvic.transaminase 19 0.95 38.71 75.21 5.00 17.48 25.00 40.62 1061.00 ▇▁▁▁▁
eGFR 19 0.95 84.04 29.77 2.15 67.57 89.38 104.71 215.45 ▂▆▇▁▁
creatinine 19 0.95 106.30 136.83 12.50 58.00 75.49 96.19 1430.00 ▇▁▁▁▁

5. Correlations

5.1. Introduction

Below chapter is a trial of finding correlations between attributes taken out of bloodtests and final outcome (if patient died or survived).

For every further correlation analysis this customized function was applied. Please note the threshold parameter that limits the output to significancy at desired level.

  my_correlation <- function (selected_parameters, reference = "outcome", threshold = 0) {

  df_cor <- GroupedByPatientDF[ , selected_parameters] %>% mutate_if(is.character, as.factor)%>% mutate_if(is.factor, as.numeric)
  corr <- cor(df_cor, use= "pairwise.complete.obs", method = "spearman")
  corr[lower.tri(corr,diag=TRUE)] <- NA
  corr <- as.data.frame(as.table(corr))%>%na.omit(corr)%>%subset(Var1 == reference)
  corr <- subset(corr, abs(Freq) >= threshold)
  corr <- corr[order(-abs(corr$Freq)),]
  mtx_corr <<- reshape2::acast(corr, Var1~Var2, value.var="Freq")
  
  corrplot(mtx_corr, is.corr=FALSE, tl.col="black", na.label=" ", method = "number", cl.pos = "n", tl.cex = 0.85, number.cex = 0.8)
  
  }

5.2. Most significant attributes

In order to distinguish attributes that may have the biggest impact on outcome a threshold level of 0.65 has been applied.

  my_correlation(selected_parameters = 1:78, threshold = 0.65)

  table <- as.data.frame(t(mtx_corr)) %>% arrange(outcome)
  knitr::kable(table)
outcome
Lactate.dehydrogenase -0.8151941
Fibrin.degradation.products -0.7926060
neutrophils… -0.7821134
procalcitonin -0.7783379
High.sensitivity.C.reactive.protein -0.7711912
Amino.terminal.brain.natriuretic.peptide.precursor.NT.proBNP. -0.7495679
Hypersensitive.cardiac.troponinI -0.7492457
D.D.dimer -0.7466308
neutrophils.count -0.7126498
Prothrombin.time -0.6907276
International.standard.ratio -0.6902048
Interleukin.6 -0.6681017
eosinophils… 0.6716401
monocytes… 0.6733578
calcium 0.6768312
Prothrombin.activity 0.6864320
lymphocyte.count 0.7353506
albumin 0.7408062
X…lymphocyte 0.7916369

5.3. Specific parameters

According to the article there are three parameters that show signigficant correlation with the outcome. The further analysis is going to focus on them:

  • Lactate.dehydrogenase
  • lymphocyte.count
  • High.sensitivity.C.reactive.protein

5.4. Analysis of the 3 parameters

selected_parameters <- c("Lactate.dehydrogenase", "lymphocyte.count", "High.sensitivity.C.reactive.protein")
knitr::kable(summary(GroupedByPatientDF[ , selected_parameters]))
Lactate.dehydrogenase lymphocyte.count High.sensitivity.C.reactive.protein
Min. : 116.0 Min. : 0.0250 Min. : 0.10
1st Qu.: 224.0 1st Qu.: 0.5463 1st Qu.: 10.20
Median : 306.4 Median : 0.9137 Median : 49.00
Mean : 462.1 Mean : 1.1067 Mean : 70.41
3rd Qu.: 610.2 3rd Qu.: 1.3717 3rd Qu.:118.10
Max. :1867.0 Max. :43.0550 Max. :320.00
NA’s :19 NA’s :19 NA’s :22

5.5. Correlation between the 3 parameters and outcome. What about age?

Below analysis shows the correlations between the 3 parameters and likelihood of survival. What is interesting, age itself does not show very strong correlation with outcome (since outbreak of COVID19 higher age has always been treated as a potential threat for patient). Below chart shows also the dominant number of patients between 60-70 in comparison to other age groups:

ggplot(data=GroupedByPatientDF, aes(age)) + 
  geom_histogram(bins = 10)

  selected_parameters <- c("outcome", "Lactate.dehydrogenase", "lymphocyte.count", "High.sensitivity.C.reactive.protein", "age")
  my_correlation(selected_parameters)

  knitr::kable(mtx_corr)
Lactate.dehydrogenase lymphocyte.count High.sensitivity.C.reactive.protein age
outcome -0.8151941 0.7353506 -0.7711912 -0.5688989

5.6. In-depth analysis of the 3 parameters and age

legend_colors <- c("died" = "#000000","survived" = "#119826")

ggplotly(
ggplot(GroupedByPatientDF, aes(x=age, y=Lactate.dehydrogenase ,col =outcome)) + geom_point() + scale_color_manual(values =  legend_colors) + ggtitle("Lactate.dehydrogenase vs age")
)

The higher the level of Lactate.dehydrogenase, the higher the probability of death, majority of occurencies happened to older patients.

ggplotly(
ggplot(GroupedByPatientDF, aes(x=age, y=lymphocyte.count ,col =outcome)) + geom_point() + scale_color_manual(values =  legend_colors) + ylim (0, 5) + ggtitle("lymphocyte.count vs age")
)

The lower the level of lymphocyte.count, the higher the probability of death, majority of occurencies happened to older patients.

ggplotly(
ggplot(GroupedByPatientDF, aes(x=age, y=High.sensitivity.C.reactive.protein ,col =outcome)) + geom_point() + scale_color_manual(values =  legend_colors) + ggtitle("High.sensitivity.C.reactive.protein vs age")
)

The higher the level of High.sensitivity.C.reactive.protein, the higher the probability of death, majority of occurencies happened to older patients.

6. Classification model

6.1. Preparing dataset

selected_parameters <- c("outcome", "Lactate.dehydrogenase", "lymphocyte.count", "High.sensitivity.C.reactive.protein")

ClassificationDF <- GroupedByPatientDF[ , c(selected_parameters)]

ClassificationDF <- ClassificationDF %>% filter(!is.nan(Lactate.dehydrogenase))
ClassificationDF <- ClassificationDF %>% filter(!is.nan(lymphocyte.count))
ClassificationDF <- ClassificationDF %>% filter(!is.nan(High.sensitivity.C.reactive.protein))

6.2. Dividing dataset

set.seed(23)
inTraining <- createDataPartition(
y = ClassificationDF$outcome,
p = .75,
list = FALSE)

training <- ClassificationDF[ inTraining,]
testing  <- ClassificationDF[-inTraining,]

6.3. Schema

ctrl <- trainControl(
method = "repeatedcv",
number = 2,
repeats = 5)

6.4. Learning

fit <- train(outcome ~ .,
data = training,
method = "rf",
trControl = ctrl,
ntree = 10)
## note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
fit
## Random Forest 
## 
## 264 samples
##   3 predictor
##   2 classes: 'died', 'survived' 
## 
## No pre-processing
## Resampling: Cross-Validated (2 fold, repeated 5 times) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.9378788  0.8748940
##   3     0.9333333  0.8655161
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

6.5. Prediction and parameter optimization

rfClasses <- predict(fit, newdata = testing)
confusionMatrix(table(rfClasses, testing$outcome))
## Confusion Matrix and Statistics
## 
##           
## rfClasses  died survived
##   died       37        3
##   survived    2       45
##                                          
##                Accuracy : 0.9425         
##                  95% CI : (0.871, 0.9811)
##     No Information Rate : 0.5517         
##     P-Value [Acc > NIR] : 4.78e-16       
##                                          
##                   Kappa : 0.8841         
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.9487         
##             Specificity : 0.9375         
##          Pos Pred Value : 0.9250         
##          Neg Pred Value : 0.9574         
##              Prevalence : 0.4483         
##          Detection Rate : 0.4253         
##    Detection Prevalence : 0.4598         
##       Balanced Accuracy : 0.9431         
##                                          
##        'Positive' Class : died           
## 
rfGrid <- expand.grid(mtry = 10:30)
gridCtrl <- trainControl(
method = "repeatedcv",
summaryFunction = twoClassSummary,
classProbs = TRUE,
number = 2,
repeats = 5)

set.seed(23)
fitTune <- train(outcome ~ .,
data = training,
method = "rf",
metric = "ROC",
preProc = c("center", "scale"),
trControl = gridCtrl,
tuneGrid = rfGrid,
ntree = 30)

fitTune
## Random Forest 
## 
## 264 samples
##   3 predictor
##   2 classes: 'died', 'survived' 
## 
## Pre-processing: centered (3), scaled (3) 
## Resampling: Cross-Validated (2 fold, repeated 5 times) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##   10    0.9755903  0.9283333  0.9472222
##   11    0.9769329  0.9266667  0.9472222
##   12    0.9743866  0.9316667  0.9361111
##   13    0.9764005  0.9300000  0.9388889
##   14    0.9764468  0.9283333  0.9430556
##   15    0.9742130  0.9266667  0.9416667
##   16    0.9761921  0.9300000  0.9472222
##   17    0.9738310  0.9266667  0.9444444
##   18    0.9770949  0.9283333  0.9416667
##   19    0.9757870  0.9216667  0.9486111
##   20    0.9728704  0.9283333  0.9444444
##   21    0.9764699  0.9316667  0.9444444
##   22    0.9762500  0.9366667  0.9388889
##   23    0.9767824  0.9333333  0.9458333
##   24    0.9760185  0.9316667  0.9388889
##   25    0.9750926  0.9216667  0.9416667
##   26    0.9755787  0.9266667  0.9472222
##   27    0.9770486  0.9266667  0.9472222
##   28    0.9757407  0.9283333  0.9416667
##   29    0.9752662  0.9266667  0.9402778
##   30    0.9760069  0.9216667  0.9444444
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 18.
ggplot(fitTune) + theme_bw()